The Best 161 Multimodal Fusion Tools in 2025
Codebert Base
CodeBERT is a pre-trained model for programming languages and natural languages, based on the RoBERTa architecture, supporting functions like code search and code-to-document generation.
Multimodal Fusion
C
microsoft
1.6M
248
Llama 4 Scout 17B 16E Instruct
Other
Llama 4 Scout is a multimodal AI model developed by Meta, featuring a mixture-of-experts architecture, supporting text and image interactions in 12 languages, with 17B active parameters and 109B total parameters.
Multimodal Fusion
Transformers Supports Multiple Languages

L
meta-llama
817.62k
844
Unixcoder Base
Apache-2.0
UniXcoder is a unified multimodal pretrained model that leverages multimodal data such as code comments and abstract syntax trees for pretraining code representations.
Multimodal Fusion
Transformers English

U
microsoft
347.45k
51
TITAN
TITAN is a multimodal whole slide foundation model pre-trained through visual self-supervised learning and vision-language alignment for pathology image analysis.
Multimodal Fusion
Safetensors English
T
MahmoodLab
213.39k
37
Qwen2.5 Omni 7B
Other
Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities such as text, images, audio, and video, and generating text and natural speech responses in a streaming manner.
Multimodal Fusion
Transformers English

Q
Qwen
206.20k
1,522
Minicpm O 2 6
MiniCPM-o 2.6 is a GPT-4o-level multimodal large model that runs on mobile devices, supporting vision, voice, and live stream processing
Multimodal Fusion
Transformers Other

M
openbmb
178.38k
1,117
Llama 4 Scout 17B 16E Instruct
Other
Llama 4 Scout is a 17B parameter/16-expert multimodal AI model from Meta, supporting 12 languages and image understanding with industry-leading performance.
Multimodal Fusion
Transformers Supports Multiple Languages

L
chutesai
173.52k
2
Qwen2.5 Omni 3B
Other
Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities including text, images, audio, and video, while synchronously generating text and natural speech responses in a streaming manner.
Multimodal Fusion
Transformers English

Q
Qwen
48.07k
219
One Align
MIT
Q-Align is a multi-task visual assessment model focusing on Image Quality Assessment (IQA), Image Aesthetic Assessment (IAA), and Video Quality Assessment (VQA), published at ICML 2024.
Multimodal Fusion
Transformers

O
q-future
39.48k
25
Biomedvlp BioViL T
MIT
BioViL-T is a vision-language model focused on analyzing chest X-rays and radiology reports, enhancing performance through temporal multimodal pretraining.
Multimodal Fusion
Transformers English

B
microsoft
26.39k
35
Chameleon 7b
Other
Meta Chameleon is a hybrid-modality early-fusion foundational model developed by FAIR, supporting multimodal processing of images and text.
Multimodal Fusion
Transformers

C
facebook
20.97k
179
LLM2CLIP Llama 3 8B Instruct CC Finetuned
Apache-2.0
LLM2CLIP is an innovative approach that enhances CLIP's cross-modal capabilities through large language models, significantly improving the discriminative power of visual and text representations.
Multimodal Fusion
L
microsoft
18.16k
35
Unixcoder Base Nine
Apache-2.0
UniXcoder is a unified multimodal pretraining model that leverages multimodal data (such as code comments and abstract syntax trees) to pretrain code representations.
Multimodal Fusion
Transformers English

U
microsoft
17.35k
19
Llama Guard 4 12B
Other
Llama Guard 4 is a native multimodal safety classifier with 12 billion parameters, jointly trained on text and multiple images for content safety evaluation of large language model inputs and outputs.
Multimodal Fusion
Transformers English

L
meta-llama
16.52k
30
Spatialvla 4b 224 Pt
MIT
SpatialVLA is a spatial enhanced vision - language - action model trained on 1.1 million real robot operation segments, focusing on robot control tasks.
Multimodal Fusion
Transformers English

S
IPEC-COMMUNITY
13.06k
5
Pi0
Apache-2.0
Pi0 is a general robot control model based on vision-language-action flow, supporting robot control tasks.
Multimodal Fusion
P
lerobot
11.84k
230
Colnomic Embed Multimodal 7b
Apache-2.0
ColNomic Embed Multimodal 7B is a state-of-the-art multi-vector multimodal embedding model, excelling in visual document retrieval tasks with support for multilingual and unified text-image encoding.
Multimodal Fusion Supports Multiple Languages
C
nomic-ai
7,909
45
Llama 4 Scout 17B 16E Linearized Bnb Nf4 Bf16
Other
Llama 4 Scout is a 17-billion-parameter Mixture of Experts (MoE) model released by Meta, supporting multilingual text and image understanding with a linearized expert module design for PEFT/LoRA compatibility.
Multimodal Fusion
Transformers Supports Multiple Languages

L
axolotl-quants
6,861
3
Cogact Base
MIT
CogACT is a novel Vision-Language-Action (VLA) architecture that combines vision-language models with specialized action modules for robotic manipulation tasks.
Multimodal Fusion
Transformers English

C
CogACT
6,589
12
Llama 4 Maverick 17B 128E Instruct FP8
Other
A native multi-modal AI model in the Llama 4 series, supporting text and image understanding, adopting a mixture-of-experts architecture, suitable for commercial and research scenarios.
Multimodal Fusion
Transformers Supports Multiple Languages

L
RedHatAI
5,679
1
Colnomic Embed Multimodal 3b
ColNomic Embed Multimodal 3B is a 3-billion-parameter multimodal embedding model specifically designed for visual document retrieval tasks, supporting unified encoding of multilingual text and images.
Multimodal Fusion Supports Multiple Languages
C
nomic-ai
4,636
17
Llama Guard 3 11B Vision
A multimodal content security classifier fine-tuned based on Llama-3.2-11B, optimized for detecting harmful text-image hybrid content
Multimodal Fusion
Transformers Supports Multiple Languages

L
meta-llama
4,553
60
Dse Qwen2 2b Mrl V1
Apache-2.0
DSE-QWen2-2b-MRL-V1 is a dual-encoder model specifically designed for encoding document screenshots into dense vectors to facilitate document retrieval.
Multimodal Fusion Supports Multiple Languages
D
MrLight
4,447
56
Biomedclip Vit Bert Hf
MIT
A BiomedCLIP model implemented based on PyTorch and Huggingface frameworks, reproducing the original microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 model
Multimodal Fusion
Transformers English

B
chuhac
4,437
1
Ming Lite Omni
MIT
A lightweight unified multi-modal model that efficiently processes various modal data such as images, texts, audios, and videos, and performs excellently in speech and image generation.
Multimodal Fusion
Transformers

M
inclusionAI
4,215
103
Qwen2.5 Omni 7B GPTQ 4bit
MIT
A 4-bit GPTQ quantized version of the Qwen2.5-Omni-7B model, supporting multilingual and multimodal tasks.
Multimodal Fusion
Safetensors Supports Multiple Languages
Q
FunAGI
3,957
51
Taxabind Vit B 16
MIT
TaxaBind is a multimodal embedding space model incorporating six modalities, focusing on ecological applications, supporting zero-shot classification of species images using taxonomic text categories.
Multimodal Fusion
T
MVRL
3,672
0
GR00T N1 2B
NVIDIA Isaac GR00T N1 is the world's first open-source foundational model for general humanoid robot reasoning and skills, with a scale of 2 billion parameters.
Multimodal Fusion
G
nvidia
3,631
284
Hume System2
MIT
Hume-System2 is the pre-trained weights of System 2 for a dual-system Vision-Language-Action (VLA) model, used to accelerate the training of System 2 and provide support for relevant research and applications in the field of robotics.
Multimodal Fusion
Transformers English

H
Hume-vla
3,225
1
Llave 0.5B
Apache-2.0
LLaVE is a multimodal embedding model based on the LLaVA-OneVision-0.5B model, with a parameter scale of 0.5B, capable of embedding text, images, multiple images, and videos.
Multimodal Fusion
Transformers English

L
zhibinlan
2,897
7
Libero Object 1
MIT
Hume-Libero_Object is a dual-system vision-language-action model trained on the Libero-Object dataset. It has System 2 thinking ability and is suitable for research and applications in the field of robotics.
Multimodal Fusion
Transformers English

L
Hume-vla
2,836
0
Libero Goal 1
MIT
Hume-Libero_Goal is a Vision-Language-Action model based on dual-system thinking, designed specifically for robot tasks, integrating System-2 thinking to improve decision-making ability.
Multimodal Fusion
Transformers English

L
Hume-vla
2,698
1
Rdt 1b
MIT
A 1-billion-parameter imitation learning diffusion Transformer model pretrained on 1M+ multi-robot operation data, supporting multi-view visual-language-action prediction
Multimodal Fusion
Transformers English

R
robotics-diffusion-transformer
2,644
80
Openvla 7b Oft Finetuned Libero Spatial
MIT
OpenVLA-OFT is an optimized vision-language-action model that significantly improves the running speed and task success rate of the basic OpenVLA model through fine-tuning technology.
Multimodal Fusion
Transformers

O
moojink
2,513
3
Llama 4 Scout 17B 16E Unsloth Bnb 4bit
Other
Llama 4 Scout is a multimodal mixture-of-experts model developed by Meta, supporting 12 languages and image understanding, with 17 billion active parameters and a 10M context length.
Multimodal Fusion
Transformers Supports Multiple Languages

L
unsloth
2,492
1
Omniembed V0.1
MIT
A multimodal embedding model based on Qwen2.5-Omni-7B, supporting unified embedding representations for cross-lingual text, images, audio, and video
Multimodal Fusion
O
Tevatron
2,190
3
Llama 4 Maverick 17B 128E Instruct FP8
Other
Llama 4 Maverick is a native multi-modal AI model launched by Meta. It uses a mixture of experts architecture and supports text and image input, outputting multi-language text and code.
Multimodal Fusion
Transformers Supports Multiple Languages

L
chutesai
2,019
0
Llama 4 Scout 17B 16E Unsloth Dynamic Bnb 4bit
Other
Llama 4 Scout is Meta's 17-billion parameter Mixture of Experts multimodal model supporting 12 languages and image understanding
Multimodal Fusion
Transformers Supports Multiple Languages

L
unsloth
1,935
2
Llama 4 Scout 17B 16E Instruct INT4
Other
The Llama 4 series is a native multimodal AI model launched by Meta. It adopts the Mixture of Experts architecture, supports text and image interaction, and performs excellently in various language and visual tasks.
Multimodal Fusion
Transformers Supports Multiple Languages

L
fahadh4ilyas
1,864
0
Llama 4 Scout 17B 16E Instruct FP8
Other
The Llama 4 series is a native multimodal AI model launched by Meta, supporting text and image interaction. It adopts the Mixture of Experts architecture and performs excellently in text and image understanding.
Multimodal Fusion
Transformers Supports Multiple Languages

L
fahadh4ilyas
1,760
0
Eagle X5 13B Chat
Eagle is a series of high-resolution multimodal large language models centered around vision, supporting an input resolution of over 1K and performing excellently in tasks such as optical character recognition and document understanding.
Multimodal Fusion
Transformers

E
NVEagle
1,748
28
Llama Guard 3 11B Vision
A multimodal content security classification model based on Llama-3.2-11B, supporting the detection of harmful text/image inputs and responses
Multimodal Fusion
Transformers Supports Multiple Languages

L
SinclairSchneider
1,725
1
- 1
- 2
- 3
- 4